Regularized Raking

for Better Survey Estimates

Andy Timm

February 8, 2026

Introduction

Raking is among the most commonly used algorithms for building survey weights such that a unrepresentative sample can be used to make inferences about the general population.

Regularized Raking extends this framework, allowing for more explicit and granular tradeoffs to be made on properties of the weights set.

Goals:

  1. Review vanilla raking, and clarify in what sense the weights it produces are ‘optimal’.
  2. Argue that vanilla raking’s implicit objective isn’t a great fit for modern survey inference, and show why Regularized Raking is a good framework for improvements.

Outline

  • Quick about me (& my weights!)
  • Raking: easy mode (introduce/review vanilla raking)
  • Raking: hard mode (when raking gets confusing)
  • Introduce regularized raking
  • Examples where RR really helps
  • RR in the survey weights cinematic universe (design vs. model based inference, comparisons to MRP, etc)

About me (& my weights)

  • Data scientist working in politics and advertising; both have given me a lot of cause to think hard about weights.

Some recent types of weights I’ve worked on:

  • TV viewership weights (weight ~41M TV viewership panel to US TV viewing gen pop)
  • Political surveys (Polls, message tests, microtargeting surveys…)
  • COVID Vaccine Message Testing (multi-arm RCT to convince people to get vaccine, weight to customers of our pharmacy partner)
  • + More (Market Research, field experiments in politics, RCTs in advertising with survey outcomes…)

Raking (Easy Mode)

Let’s introduce/review raking!

We have a non-representative sample from a larger universe, which we want to use to make inferences about that universe. How do we best assign weights to each observation in the sample to make (weighted) estimates representative?

In Easy Mode:

  • Our outcome \(Y^N\) is correlated with demographics (even in easy mode)
  • We’ll have Ignorable Non-Response (conditional on weighting vars)
  • More specifically, our sample happens to only need adjustment on age and race to match gen pop on this outcome
  • We somehow know everything we need to weight on to estimate \(Y^N\)
  • We also somehow know age and race proportions of the sample and universe completely without error

Raking

Raking is an iterative algorithm for building sample weights such that (weighted) sample marginal distributions match population marginal distributions.

We’ll generate a synthetic universe, sample in biased fashion from it, then rake it by hand and with autumn in R.

Generating Data

n_universe <- 100000
n_sample <- 1000

# Universe Data Generating Process
universe <- tibble(
  age = sample(c("young", "old"),
               n_universe, replace = TRUE,
               prob = c(0.5, 0.5)),
  race = sample(c("white", "black", "asian", "hispanic", "other"),
                n_universe, replace = TRUE,
                prob = c(0.6, 0.1, 0.1, 0.1, 0.1)))

# Defining the outcome (probability of enjoying pineapple on pizza)
universe <- universe %>%
  mutate(enjoys_pineapple = plogis(rnorm(n_universe,-1,.5) +  #bias term
                              1*ifelse(race == "white", 1, 0) + # More likely if white
                              .5*ifelse(age == "old", 1, 0))) # More likely if older

Check distribution

universe %>% summarize(mean_p = mean(enjoys_pineapple))
# A tibble: 1 × 1
  mean_p
   <dbl>
1  0.467
universe %>%
  group_by(age,race) %>%
  summarize(mean_p = mean(enjoys_pineapple)) %>%
  arrange(mean_p)
# A tibble: 10 × 3
# Groups:   age [2]
   age   race     mean_p
   <chr> <chr>     <dbl>
 1 young asian     0.277
 2 young black     0.279
 3 young other     0.281
 4 young hispanic  0.282
 5 old   other     0.382
 6 old   asian     0.383
 7 old   hispanic  0.385
 8 old   black     0.386
 9 young white     0.499
10 old   white     0.616

Sample with Bias

# Parameters for the sample distribution
age_weights <- c(young = 1, old = 2)  # Old twice as likely
race_weights <- c(white = 5, black = 2,
                  asian = .5, hispanic = 1, other = 3)  # Also Wrong

universe <- universe %>%
  mutate(
    age_weight = age_weights[age],
    race_weight = race_weights[race],
    sample_weight = age_weight * race_weight
  )

# Draw the sample
sample_data <- universe %>%
  sample_n(size = n_sample, weight = sample_weight)

# Not our mean!
sample_data %>% summarize(mean_p = mean(enjoys_pineapple))
# A tibble: 1 × 1
  mean_p
   <dbl>
1  0.537

Shake and Rake

Raking proceeds variable by variable, adding weights to fix bias on 1 marginal at a time, until convergence.

# 1. Determine the Target Distribution for Age in the Universe
age_distribution_universe <- universe %>%
  count(age) %>%
  mutate(proportion = n / sum(n))

# 2. Calculate Initial Sample Weights (assuming equal probability of selection initially)
sample_data <- sample_data %>%
  mutate(weight = 1)

Shake and Rake

# 3. Raking Step: Adjust Weights for Age
# a. Calculate the distribution of age in the sample
age_distribution_sample <- sample_data %>%
  count(age) %>%
  mutate(proportion = n / sum(n))

# b. Join with the target distribution to get the adjustment factor
adjustment_factors <- age_distribution_sample %>%
  left_join(age_distribution_universe, by = "age", suffix = c("_sample", "_universe")) %>%
  mutate(adjustment_factor = proportion_universe / proportion_sample)

# c. Apply the adjustment factor to the sample weights
sample_data <- sample_data %>%
  left_join(adjustment_factors, by = "age") %>%
  mutate(raked_weight = weight * adjustment_factor)

# 2x weight for older folks, as we'd expect
sample_data %>% group_by(age) %>% summarize(mean_weight = mean(raked_weight))
# A tibble: 2 × 2
  age   mean_weight
  <chr>       <dbl>
1 old         0.756
2 young       1.47 

Shake and Rake

Now race

# Not showing calculation, since they're identical.
sample_data %>% group_by(race) %>% summarize(mean_weight = mean(raked_weight))
# A tibble: 5 × 2
  race     mean_weight
  <chr>          <dbl>
1 asian          6.68 
2 black          1.87 
3 hispanic       4.22 
4 other          1.14 
5 white          0.731
# Notice: slightly further from universe than after first iter
sample_data %>% group_by(age) %>% summarize(mean_weight = mean(raked_weight))
# A tibble: 2 × 2
  age   mean_weight
  <chr>       <dbl>
1 old         0.741
2 young       1.50 

... and repeat until sample means on all marginals match population ones.

Less Manual: with autumn

autumn is a wonderful package for raking.

library(autumn)

target <- list(
  age = c(young = .5, old = .5),
  race = c(white = .6,
             hispanic = .1,
             black = .1,
             asian = .1,
             other = .1)
)
# See ?harvest for summary of other functionality
sample_data <- sample_data %>% harvest(target,max_weight = 10)

# With raked weights, we recover the population mean as expected
sample_data %>% summarize(mean_p = weighted.mean(enjoys_pineapple,weights))
# A tibble: 1 × 1
  mean_p
   <dbl>
1  0.467

Raking (Hard Mode)

Back to reality

Sadly, reality makes weighting much harder.

  • Phone response rates are now frequently less than 1%
  • Many survey researchers now use online or other non-probability sample methods
  • Outcomes are frequently related to propensity to respond in complicated ways
  • So, respondents are getting increasingly weird, in ways we can’t just weight out along a small set of census demographics

In other words, we live in a world of Non-Ignorable Nonresponse. What effects does this have on weighting?

A huge amount of pressure on weighting

  • Trying to weight to cover all the issues sampling can’t.
  • On the left, everything the NYT weighted their recent 2024 battleground state poll on.
  • Of course, we probably believe interactions of variables matter too…
  • Many of these variables are included because they seem correlated with constructs we’d struggle to quantify directly

Raking on everything isn’t free

  • In many cases, raking on everything we want to won’t even converge.
  • So we end up picking and choosing which variables and interactions theoretically matter most.

Second class of issues: variance problems

  • Raking on more dimensions can lower bias, but it often adds variance too
  • This leads to heuristic ways to control variance
  • Rules of thumb like desiring a \(deff_{kish}\) < 1.5
  • Trimming weights (cap/floor the tails of the weights distribution)

Restating problems with raking on hard mode

Let’s restate the last few slides in terms of specific problems alternative weighting methods can solve:

  1. “Making better weights” feels incredibly heuristic with vanilla raking
    1. How do we select variables and interactions?
    2. How do we trade off bias for variance?
  2. Variables are either “in” or “out”- there’s no way to say e.g. “race x education matters more to me than gender x race, but include both”
  3. It’s not clear in what sense the raking solution is optimal

Regularized Raking

Regularized Raking

(Barratt, Angeris, and Boyd 2021) show that the broader representative sample weights problem can be understood as an optimization problem:

\[ \begin{array}{ll} \operatorname{minimize} & \ell\left(f, f^{\mathrm{target}}\right)+\lambda r(w) \\ \text { subject to } & \quad w \geq 0, \quad \mathbf{1}^T w=1 \end{array} \] Let’s step through this term by term.

Regularized Raking - Loss

\[ \begin{array}{ll} \operatorname{\color{grey}{minimize}} & \ell\left(f, f^{\mathrm{target}}\right)\color{grey}{+\lambda r(w)} \\ \color{grey}{ \text { subject to } } & \color{grey}{ \quad w \geq 0, \quad \mathbf{1}^T w=1} \end{array} \]

  • We have \(f\) and \(f^{target}\), which are functions of our sample and target population.
  • We’ll also specify some loss function \(\ell\), which clarifies how we want \(f\) and \(f^{target}\) to be close

Regularized Raking

Loss examples - Ok, so what?

\[ \ell\left(f, f^{\mathrm{des}}\right)= \begin{cases}0 & f = f^{target} \\ +\infty & \text { otherwise }\end{cases} \]

  • In other words, \(f\) must match \(f^{target}\) exactly to be a valid solution
  • If \(f\) is a vector of all the marginal expected values for our raking dimensions, this is what vanilla raking requires!

Regularized Raking

Loss examples - Getting Interesting

\[ \ell\left(f, f^{\mathrm{des}}\right)= \begin{cases}0 & f^{\min } \leq f \leq f^{\max } \\ +\infty & \text { otherwise }\end{cases} \]

  • ... But why do we always want exact matching on every raking variable?
  • That would expand the realm of possible solutions!

Regularized Raking

Loss examples - Now We’re Getting Somewhere

  • Continuous Loss is possible!
  • Closer can be better, and further away is not infinitely bad!

Regularized Raking

Loss examples - full force

  • We can have different losses for different \(f\)!
  • Example:
    • require exact matching on single dimensions,
    • least squares on all interactions, and
    • different scaling such that different interactions are “worth” more according to theory or backtesting.

Regularized Raking

Regularization

\[ \begin{array}{ll} \operatorname{\color{grey}{minimize}} & \color{grey}{\ell\left(f, f^{\mathrm{target}}\right)+}\lambda r(w) \\ \color{grey}{ \text { subject to } } & \color{grey}{ \quad w \geq 0, \quad \mathbf{1}^T w=1} \end{array} \]

  • We don’t just care about adherence to population totals.
  • What about variance of weights, shape of distribution, etc?
  • We might want weights that are:
    • As uniform as possible
    • As close to some other target distribution with different variance properties
  • \(r(w)\) can be chosen from a variety of regularizers that encode such preferences
  • Also have a hyperparameter \(\lambda\), which allows us to make explicit tradeoffs between our loss term and our regularization term.

Regularized Raking

Regularization Example 1

\[ r(w) = \sum_{i=1}^{n} w_i \log w_i \]

  • This is the negative entropy, equivalent to the Kullback–Leibler divergence from the weights distribution \(w\) and the uniform distribution.
  • This would express a desire that our weights be as uniform as possible, subject to all our other constraints
  • This is the other half of what vanilla raking does!

Regularized Raking

Regularization Example 2

\[ D_{\text{KL}}(w \parallel w^{des}) = \sum_{i=1}^{n} w_i \log\left(\frac{w_i}{w_i^{des}}\right) \]

  • Alternatively, might want a weights distribution close to some target distribution \(w^{des}\), where closeness is defined in terms of KL divergence.
  • Motivations: minimizing extreme tails, more demand for smoothness

Regularized Raking

Constraints

\[ \begin{array}{ll} \operatorname{\color{grey}{minimize}} & \color{grey}{\ell\left(f, f^{\mathrm{des}}\right)+\lambda r(w)} \\ \text { subject to } & \quad w \geq 0, \quad \mathbf{1}^T w=1 \end{array} \]

  • Non-negative weights are useful (not all estimators play well with negative weights)
  • We want weights to sum to 1 (equivalently: rescale to sum to n)

So what does vanilla raking do?

\[ \ell\left(f, f^{\mathrm{target}}\right)=\left\{\begin{array}{ll} 0 & f=f^{\mathrm{target}}, \\ +\infty & \text { otherwise }, \end{array} \quad r(w)=\sum_{i=1}^n w_i \log w_i\right. \]

  • Equality loss on all raking dimensions
  • Negative entropy regularizer
  • Hopefully, this gives some clarity about what raking optimizes for
  • As we’ve seen, this is far from the only choice of \(\ell\) and \(r(w)\); are other choices better?

Examples where RR helps

Data

Let’s get back into some code, and comparisons. I’ll be using a poll conducted by Pew in 2016, with 2074 likely voter respondents. The outcome of interest is vote margin, or the number of voters who favored Clinton minus those who favored Trump.

I like this dataset for a few reasons:

  • Easy access
  • There is a correct answer, and we know it (Hillary won the national vote by 2.2%)
  • If you do not weight by education, you will be wrong by like 10 points :)
  • Decent variety of covariates

We’ll treat estimates from the 2016 CCES as a target, because it’s huge, representative, and also easily accessible.

Introducing regrake

Let’s walk through a basic example in both survey (for vanilla raking) and regrake (our R implementation of regularized raking based on (Barratt, Angeris, and Boyd 2021)). We’ll weight just on age, gender, race, and region to show syntax.

Vanilla raking with survey

formula_1 <- ~recode_age_bucket + recode_female + recode_race + recode_region

# Convenience function to make targets for a formula given a `survey` design obj
target_1 <- create_targets(cces_wt_design, formula_1)

pew_raked_1 <- calibrate(design = pew_design,
                       formula = formula_1,
                       population = target_1,
                       calfun = "raking")

svycontrast(svymean(~recode_vote_2016, pew_raked_1, na.rm = TRUE),
                        vote_contrast)
           nlcon     SE
contrast 0.08901 0.0259

Now with regrake

# With regrake, we directly specify what variables to match
# and use the CCES survey design as our population target
regrake_result_1 <- regrake(
  data = pew,
  formula = ~ rr_exact(recode_age_bucket) + rr_exact(recode_female) +
              rr_exact(recode_race) + rr_exact(recode_region),
  population_data = cces_wt_design,
  pop_type = "survey_design",
  regularizer = "entropy",
  lambda = 1
)

Steps:

  1. Specify the formula with constraint types (rr_exact() for exact matching)
  2. Choose a regularizer (entropy = prefer uniform weights)
  3. Set the regularization strength (\(\lambda\))

Illustration: this is what raking is

n_pew <- nrow(pew)
regrake_weights_1 <- regrake_result_1$weights  # Already sums to n

regrake_design_1 <- svydesign(ids = ~1, data = pew, weights = regrake_weights_1,
                calibrate.formula = formula_1)

svycontrast(svymean(~recode_vote_2016, regrake_design_1, na.rm = TRUE),
                        vote_contrast)
            nlcon     SE
contrast 0.089006 0.0257

Same result as vanilla raking! With entropy regularizer and \(\lambda = 1\), we’re essentially doing the same optimization.

Illustration: weight distributions

Nearly identical weight distributions - as expected when we’re solving the same problem.

About as far as raking will take us

formula_2 <- ~ recode_age_bucket + recode_female +
    recode_region * recode_educ_3way * recode_race

# Convenience function to make targets for a formula given a `survey` design obj
target_2 <- create_targets(cces_wt_design,formula_2)

pew_raked_2 <- calibrate(design = pew_design,
                       formula = formula_2,
                       population = target_2,
                       calfun = "raking")

svycontrast(svymean(~recode_vote_2016, pew_raked_2, na.rm = TRUE),
                        vote_contrast)
            nlcon     SE
contrast 0.011029 0.0271

Something more complicated with regrake

Mixed loss types

With regularized raking, we can do something vanilla raking can’t: exact matching on main effects, but soft (L2) matching on interactions.

This lets us include more information without the convergence problems that come from trying to exactly match every cell.

Mixed loss types: the code

# Main effects: exact matching (rr_exact)
# Interactions: L2/soft matching (rr_l2)
regrake_result_2 <- regrake(
  data = pew,
  formula = ~ rr_exact(recode_age_bucket) + rr_exact(recode_female) +
              rr_exact(recode_inputstate) + rr_exact(recode_region) +
              rr_exact(recode_educ) + rr_exact(recode_race) +
              rr_l2(recode_region:recode_educ) +
              rr_l2(recode_region:recode_race) +
              rr_l2(recode_educ:recode_race) +
              rr_l2(recode_region:recode_educ:recode_race),
  population_data = cces_wt_design,
  pop_type = "survey_design",
  regularizer = "entropy",
  lambda = 10
)

51 states + 6 main effect vars + all 2/3-way interactions = lots of constraints!

Results: weight distributions

regrake produces smoother weights - less extreme values.

Results: vote margin estimate

regrake_design_2 <- svydesign(ids = ~1, data = pew, weights = regrake_weights_2,
                calibrate.formula = ~recode_age_bucket + recode_female +
     recode_region + recode_educ + recode_race)

svycontrast(svymean(~recode_vote_2016, regrake_design_2, na.rm = TRUE),
                        vote_contrast)
            nlcon     SE
contrast 0.020404 0.0282
Method Vote Margin True Margin
Unweighted ~5% 2.2%
Vanilla raking (main + interactions) 1.1% 2.2%
regrake (exact main + L2 interactions) 2.1% 2.2%

The soft constraints on interactions help us get closer to the truth!

Checking the balance: what did we actually achieve?

Unlike vanilla raking, soft constraints don’t guarantee exact balance. The balance data frame lets us inspect what regrake actually achieved:

Exact constraints (blue) cluster at 0; L2 constraints (red) have slack.

Regularized Raking in the Survey Weights Cinematic Universe

Versus other modern options

MRP (and friends!)

Multilevel Regression and Postratification (Gelman et al. 2019) uses a regularized model to predict the outcome, and poststratify.

RR Advantages

  • One set of weights regardless of outcome
  • Prior/training information can be used
  • Sometimes you actually want the whole survey posterior!

MRP Advantages

  • Fit directly on outcome(s) of interest - often more efficient
  • For most small area estimates, MRP is going to be more efficient
  • More generally, greater variety and flexibility of regularization

Versus other modern options

Multilevel Calibration

Multilevel Calibration (Ben-Michael, Feller, and Hartman 2023) also takes an optimization approach, requires exact matching on the marginals, more flexibility on interactions

RR Advantages

  • Greater flexibility to define loss, regularization
  • Optimizes for explicit objective (don’t need to back out implicit prior)

MC Advantages

  • Nice theory connecting estimator to outcome model
  • Implementation-wise, probably more accessible as of today

I’ve only been talking about weighting today, so want to emphasize:

  • Good sampling matters (and more than how you weight)
  • Selection of weighting variables matters (and more than how you weight)

That said, how you weight is entangled with these, and more flexible weighting gives you more options with the above.

Thank you!

Don’t hire me!

  • Currently happy with my job; Grow Progress is great.
  • Experience as a data science manager, data scientist, and campaign staffer.
  • keywords: Causal Inference, Bayes, HTE estimation, Surveys
  • (This is a bit; I was on the job market when I gave this talk originally)

Materials/Suggestions

  • (Barratt, Angeris, and Boyd 2021) for finer details of optimization behind talk today
  • For perspective on challenges of modern survey weighting, highly recommend A New Paradigm for Polling (Bailey 2023)
  • For how to think about picking targets and estimating them, I adore (Caughey et al. 2020)- Target Estimation and Adjustment Weighting for Survey Nonresponse and Sampling Bias

References

Bailey, Michael A. 2023. “A New Paradigm for Polling.” Harvard Data Science Review 5 (3). https://doi.org/10.1162/99608f92.9898eede.
Barratt, Shane, Guillermo Angeris, and Stephen Boyd. 2021. “Optimal Representative Sample Weighting.” Statistics and Computing 31 (2): 19. https://doi.org/10.1007/s11222-021-10001-1.
Ben-Michael, Eli, Avi Feller, and Erin Hartman. 2023. “Multilevel Calibration Weighting for Survey Data.” Political Analysis, March, 1–19. https://doi.org/10.1017/pan.2023.9.
Caughey, Devin, Adam J. Berinsky, Sara Chatfield, Erin Hartman, Eric Schickler, and Jasjeet S. Sekhon. 2020. “Target Estimation and Adjustment Weighting for Survey Nonresponse and Sampling Bias.” Elements in Quantitative and Computational Methods for the Social Sciences, September. https://doi.org/10.1017/9781108879217.
Deville, Jean-Claude, Carl-Erik Särndal, and Olivier Sautory. 1993. “Generalized Raking Procedures in Survey Sampling.” Journal of the American Statistical Association 88 (423): 1013–20. https://doi.org/10.1080/01621459.1993.10476369.
Gelman, Andrew, Jeffrey Lax, Justin Phillips, Jonah Gabry, and Robert Trangucci. 2019. “Using Multilevel Regression and Poststratification to Estimate Dynamic Public Opinion.” Unpublished Manuscript, August, 48.
Teh, Yee Whye, and Max Welling. 2003. “On Improving the Efficiency of the Iterative Proportional Fitting Procedure.” In International Workshop on Artificial Intelligence and Statistics, 262–69. PMLR. https://proceedings.mlr.press/r4/teh03a.html.